An Idiot’s guide to Support vector 
machines (SVMs) 

R. Berwick, Village Idiot 


SVMs: A New 

Generation of Learning Algorithms 

• Pre 1980: 

- Almost all learning methods learned linear decision surfaces. 

- Linear learning methods have nice theoretical properties 

• 1980’s 

- Decision trees and NNs allowed efficient learning of non¬ 
linear decision surfaces 

- Little theoretical basis and all suffer from local minima 

• 1990’s 

- Efficient learning algorithms for non-linear functions based 
on computational learning theory developed 

- Nice theoretical properties. 





Key Ideas 

• Two independent developments within last decade 

- New efficient separability of non-linear regions that use 
“kernel functions” : generalization of‘similarity’ to 
new kinds of similarity measures based on dot products 

- Use of quadratic optimization problem to avoid ‘local 
minimum’ issues with neural nets 

- The resulting learning algorithm is an optimization 
algorithm rather than a greedy search 


Organization 

• Basic idea of support vector machines: just like 1- 
layer or multi-layer neural nets 

- Optimal hyperplane for linearly separable 
patterns 

- Extend to patterns that are not linearly 
separable by transformations of original data to 
map into new space - the Kernel function 

• SVM algorithm for pattern recognition 





Support Vectors 


• Support vectors are the data points that lie closest 
to the decision surface (or hyperplane) 

• They are the data points most difficult to classify 

• They have direct bearing on the optimum location 
of the decision surface 

• We can show that the optimal hyperplane stems 
from the function class with the lowest 
“capacity’- # of independent features/parameters 
we can twiddle [note this is ‘extra’ material not 
covered in the lectures... you don’t have to know 
this] 


Recall from 1-layer nets : Which Separating 


Hyperplane? 



o 

In general, lots of possible 
solutions for a,b,c (an 
infinite number!) 

Support Vector Machine 
(SVM) finds an optimal 
solution 


o 








Support Vector Machine (SVM) 

• SVMs maximize the m a rain 

(Winston terminology: the ‘street’) 

Support vectors 

around the separating hyperplane. 

\\l ° ° 0 

• The decision function is fully 


specified by a (usually very small) 

subset of training samples, the 

u □ □ 

support vectors. 

.. □ x) 

• This becomes a Quadratic 

Maximize 

programming problem that is easy 
to solve by standard methods 

margin 


Separation by Hyperplanes 

• Assume linear separability for now (we will relax this 
later) 

• in 2 dimensions, can separate by a line 

- in higher dimensions, need hyperplanes 







General input/output for SVMs just like for 
neural nets, but for one important addition... 

Input : set of (input, output) training pair samples; call the 
input sample features x p x 2 ...x n , and the output result v. 
Typically, there can be lots of input features x ; . 

Output: set of weights w (or via), one for each feature, 
whose linear combination predicts the value of y. (So far, 
just like neural nets...) 

Important difference: we use the optimization of maximizing 
the margin (‘street width’) to reduce the number of weights 
that are nonzero to just a few that correspond to the 
important features that ‘matter’ in deciding the separating 
line(hyperplane)...these nonzero weights correspond to the 
support vectors (because they ‘support’ the separating 
hyperplane) 














Which Hyperplane to pick? 

• Lots of possible solutions for a,b,c. 

• Some methods find a separating 
hyperplane, but not the optimal one (e.g., 
neural net) 

• But: Which points should influence 
optimality? 

- All points? 

• Linear regression 

• Neural nets 

- Or only “difficult points” close to 
decision boundary 

• Support vector machines 



Support Vectors again for linearly separable case 

• Support vectors are the elements of the training set that 
would change the position of the dividing hyperplane if 
removed. 

• Support vectors are the critical elements of the training set 

• The problem of finding the optimal hyper plane is an 
optimization problem and can be solved by optimization 
techniques (we use Lagrange multipliers to get this 
problem into a form that can be solved analytically). 
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Definitions 


Define the hyperplanes H such that: 
w*x +b > +1 when v ; =+ l 
w*x +b < -1 when y, = -1 

Hj and II 2 are the planes: 

/-/,: wx { +b = +1 

H 2 \ wx+b = -1 

The points on the planes //, and 

H 2 are the tips of the Support 

Vectors 

The plane H 0 is the median in 
between, where wx +b =0 



d+ = the shortest distance to the closest positive point 
d- = the shortest distance to the closest negative point 
The margin (gutter) of a separating hyperplane is d+ + d- 



Moving the other vectors 
has no effect 


Moving a support vector 
moves the decision 
boundary 



The optimization algorithm to generate the weights proceeds in such a 
way that only the support vectors determine the weights and thus the 
boundary 


8 











Defining the separating Hyperplane 

• Form of equation defining the decision surface separating 
the classes is a hyperplane of the form: 

wx + b = 0 

- w is a weight vector 

- x is input vector 

- b is bias 

• Allows us to write 

w T x + b > 0 for d, = +1 
w T x + b < 0 for d, = -1 


Some final definitions 

• Margin of Separation (d): the separation between the 
hyperplane and the closest data point for a given weight 
vector w and bias b. 

• Optimal Hyperplane (maximal margin): the particular 
hyperplane for which the margin of separation d is 
maximized. 





Maximizing the margin (aka street width) 

We want a classifier (linear separator) 
with as big a margin as possible. 


Recall the distance from a point(x 0 ,y 0 ) to a line: 

Ax+By+c = 0 is: | Ax 0 +By 0 +c|/sqrt(T 2 +5 2 ), so, 

The distance between H 0 and // is then: 

|wx+6|/|M|=l/||w||, SO o 

The total distance between Hj and H, is thus: 2/||w|| 

O 

In order to maximize the margin, we thus need to minimize ||w||. With the 
condition that there are no datapoints between H and H u 
Xj»w+b > +1 when y, =+lT 

Xj»w+b < -1 when y, =-l J Can be combined into: y,(X;*w) > 1 



We now must solve a quadratic 
programming problem 

• Problem is: minimize ||w||, s.t. discrimination boundary is 
obeyed, i.e., min f{x) s.t. g(x)=0, which we can rewrite as: 
min f: 14 11 vv11 2 (Note this is a quadratic function) 
s.t. g: y i (w*x i )-b = 1 or [y^w'x^-b] - 1 =0 

This is a constrained optimization problem 

It can be solved by the Lagrangian multipier method 

Because it is quadratic , the surface is a paraboloid, with just a 
single global minimum (thus avoiding a problem we had 
with neural nets!) 
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Flattened paraboloid f: 2x 2 +2y 2 =0 with superimposed 
constraint g: x+y = 1 



Minimize when the constraint line g (shown in green) 
is tangent to the inner ellipse contour linez of/ (shown in red) 
note direction of gradient arrows. 
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flattened paraboloid f: 2+x 2 +2y 2 =0 with superimposed constraint 
g: x+y = 1; at tangent solution p, gradient vectors of fg are 
parallel (no possible move to increment/that also keeps you in 
region g) 

.5 


>.5 

0 
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Minimize when the constraint line g is tangent to the inner ellipse 
contour line of/ 



Two constraints 

1. Parallel normal constraint (= gradient constraint 
on fg s.t. solution is a max, or a min) 

2. g(x)=0 (solution is on the constraint line as well) 

We now recast these by combining/ g as the new 
Lagrangian function by introducing new ‘slack 
variables ’ denoted a or (more usually, denoted a 
in the literature) 
























Redescribing these conditions 

• Want to look for solution point p where 

V/(p) = VAg(p) 

g(x) = 0 

• Or, combining these two as the Langrangian L & 
requiring derivative of L be zero: 

L(x,a) = f{x) - ag(x) 

V(x,a) = 0 


At a solution p 

• The the constraint line g and the contour lines of/must 
be tangent 

• If they are tangent, their gradient vectors 
(perpendiculars) are parallel 

• Gradient of g must be 0 - i.e., steepest ascent & so 
perpendicular to / 

• Gradient of/must also be in the same direction as g 





How Langrangian solves constrained 
optimization 

L(x, a) = f (x) - ag(x ) where 
V(x,a) = 0 

Partial derivatives wrt x recover the parallel normal 
constraint 

Partial derivatives wrt A- recover the g(x,j)=0 
In general, 

L(x,a) = f(x) + 2^ i a i g i (x) 


In general 


i 


Gradient min of/ 


constraint condition g 


L(x,a) - f(x) + ^ a.g.(x) a function of n + m variables 
n for the x'.s', m for the a. Differentiating gives n + m equations, each 
set to 0. The n eqns differentiated wrt each x give the gradient conditions; 
the m eqns differentiated wrt each a. recover the constraints g. 


In our casc, /(x): 'AW w|| 2 ; g(x): j/ww, +/t)-l=0 so Lagrangian is: 

min L= l A\\ w|| 2 - Y.a\y i (yvx i +b)-l] wrt w, b 
We expand the last to get the following L form: 

min L= Vi\\ w|| 2 - l.a i y l (wx i +b) +2a ; wrt w, b 




Lagrangian Formulation 


So in the SVM problem the Lagrangian is 

1 M "2 

mtnl p = j 


w 


-'L a l y,( x , vi + b ) + 'L a , 

/=1 i=1 

s.t. Vi, a. > 0 where / is the # of training points 

• From the property that the derivatives at min = 0 
we get: dL p 


dw 
()L r 
d b 


■ = w - X«,yx,. = o 

i =1 
l 

= S a ;L =0 so 


W 


L l 

= Sw,-> X a ,v =0 


z'=l 


i=l 


What’s with this L p business? 

• This indicates that this is the primal form of the 
optimization problem 

• We will actually solve the optimization problem 
by now solving for the dual of this original 
problem 

• What is this dual formulation? 





The Lagrangian Dual Problem: instead of minimizing over w, b, 
subject to constraints involving a’s, we can maximize over a (the 
dual variable) subject to the relations obtained previously for w 

and b 

Our solution must satisfy these two relations: 

/ i 

w =Iw, 'L a i y ,= 0 

i=l ;=I 

By substituting for w and b back in the original eqn we can get rid of the 
dependence on w and b. 

Note first that we already now have our answer for what the weights w 
must be: they are a linear combination of the training inputs and the 
training outputs, x t and y, and the values of a. We will now solve for the 
a ’s by differentiating the dual problem wrt a, and setting it to zero. Most 
of the a ’s will turn out to have the value zero . The non-zero a ’s will 
correspond to the support vectors 
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The Dual problem 

• Kuhn-Tucker theorem: the solution we find here will 
be the same as the solution to the original problem 

• Q: But why are we doing this???? (why not just 
solve the original problem????) 

• Ans: Because this will let us solve the problem by 
computing the just the inner products of x u x j (which 
will be very important later on when we want to 
solve non-linearly separable classification problems) 


The Dual Problem 

Dual problem: 

' l ' t \ 

max L d (a.) = X a. - - £ a.a.y.y. (x. • x .) 

/=1 ^ i * 

/ 

s.t. ^ a.y. = 0 & a. > 0 

1=1 

Notice that all we have are the dot products of x / ,x / 

If we take the derivative wrt a and set it equal to zero, 
we get the following solution, so we can solve for a{. 

Xff =° 

1 = 1 

0 < a, < C 






Now knowing the a i we can find the 
weights w for the maximal margin 
separating hyperplane: 

/ 

w = 5> ; .yx. 

i=i 

And now, after training and finding the w by this method, 
given an unknown point u measured on features x t we 
can classify it by looking at the sign of: 

/ 

fix) = w»u + b = (^a ; y ( x i*u) + b 

i =1 

Remember: most of the weights w i5 i.e., the a ’s, will be zero 
Only the support vectors (on the gutters or margin) will have nonzero 
weights ora’s - this reduces the dimensionality of the solution 


Inner products, similarity, and SVMs 

Why should inner product kernels be involved in pattern 
recognition using SVMs, or at all? 

- Intuition is that inner products provide some measure of 
‘similarity’ 

- Inner product in 2D between 2 vectors of unit length returns the 
cosine of the angle between them = how ‘far apart’ they are 

e.g.x = [l,0] T , y = [0,1] T 

i.e. if they are parallel their inner product is 1 (completely similar ) 
x T y = x»y = 1 

If they are perpendicular (completely unlike) their inner product is 
0 (so should not contribute to the correct classifier) 

x T * y = x*y = 0 
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Insight into inner products 

Consider that we are trying to maximize the fonn: 

L n ^ X a , a J y l y J (v v) 

1=1 z i=i 

/ 

s.t. a ,d ; , = 0 & a. > 0 

i=i 

The claim is that this function will be maximized if we give nonzero values to a’s that 
correspond to the support vectors, ie, those that ‘matter’ in fixing the maximum width 
margin (‘street’). Well, consider what this looks like. Note first from the constraint 
condition that all the a ’s are positive. Now let’s think about a few cases. 

Case 1. If two features x t , x, are completely dissimilar , their dot product is 0, and they don’t 
contribute to L. 

Case 2. If two features x i: Xj are completely alike , their dot product is 0. There are 2 subcases. 

Subcase 1: both x : . andx. predict the same output value v.- (either +1 or-1). Theny,. 
xy y . is always 1, and the value of apjyyxpCj will be positive. But this would decrease the 
value of L (since it would subtract from the first tenn sum). So, the algorithm downgrades 
similar feature vectors that make the same prediction. 

Subcase 2: x,,andx make opposite predictions about the output value y ; (ie, one is 
+1, the other -1), but are otherwise very closely similar: then the product ap-yyjxp is 
negative and we are subtracting it, so this adds to the sum, maximizing it. This is precisely 
the examples we are looking for: the critical ones that tell the two classses apart. 


Insight into inner products, graphically: 2 very 
very similar x p Xj vectors that predict difft 
classes tend to maximize the margin width 
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2 vectors that are similar but predict the 
same class are redundant 











But.. .are we done??? 















Problems with linear SVM 



What if the decision function is not linear? What transform would separate these? 


Ans: polar coordinates! 
Non-linear SVM 


The Kernel trick 


O O 


O 

o 


Imagine a function 0 that maps the data into another space: 

Radial 


o =-i 
» =+i 



Remember the fuifction we want to optimize: Z d = Sfl, - V^a i a-yy, (x ; *x) where (x ; *x) is the 
dot product of the two feature vectors. If we now transform to c|), instead of computing this 
dot product (x ; *x ) we will have to compute (c|> (x,)* 0 (x )). But how can we do this? This is 
expensive and time consuming (suppose § is a quartic polynomial... or worse, we don’t know the 
function explicitly. Well, here is the neat thing: 

If there is a ’’kernel function” K such that K{x t ,x) = 0 (x ; )* (|> (x ), then we do not need to know 
or compute d) at all!! That is, the kernel function defines inner products in the transformed space. 
Or, it defines similarity in the transfonned space. 
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Non-linear SVMs 

So, the function we end up optimizing is: 

L d = la t - y&(ifljyyjK{xfx •), 

Kernel example: The polynomial kernel 

K{x i,xj) = (xj'Xj + 1 Y, where p is a tunable parameter 

Note: Evaluating K only requires one addition and one exponentiation 

more than the original dot product 


Examples for Non Linear SVMs 

V(x,y) = (x-y + iy 

K(x, y) = exp-p I -% 1 ] 

K (x,y) = tanh( kx y-S) 

1 st is polynomial (includes x»x as special case) 
2 nd is radial basis function (gaussians) 

3 rd is sigmoid (neural net activation function) 
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We’ve already seen such nonlinear 
transforms... 

• What is it??? 


• tanh((3 0 x T Xi + (3J 

• It’s the sigmoid 
transform (for neural 
nets) 

• So, SVMs subsume 
neural nets! (but w/o 
their problems...) 



Inner Product Kernels 


Type of Support Vector 
Machine 

Inner Product Kernel 
K(x,X|), I = 1, 2, ..., N 

Usual inner product 

Polynomial learning 
machine 

(X T Xj + 1)P 

Power p is specified a 
priori by the user 

Radial-basis function 
(RBF) 

exp(l/(2o 2 )| |x- Xj | | 2 ) 

The width o 2 is 
specified a priori 

Two layer neural net 

tanhd3 0 x T Xj + (h) 

Actually works only for 
some values of p 0 and 

Pi 






Kernels generalize the notion of ‘inner 
product similarity’ 

Note that one can define kernels over more than just 
vectors: strings, trees, structures, ... in fact, just about 
anything 

A very powerful idea: used in comparing DNA, protein 
structure, sentence structures, etc. 





Nonlinear rbf kernel 



Admiral’s delight w/ difft kernel 
functions 


















Overfitting by SVM 



Every point is a support vector.*oo much freedom to bend to fit the 
training data - no generalization. 

In fact, SVMs have an ‘automatic’ way to avoid such issues, but we 
won’t cover it here... see the book by Vapnik, 1995. (We add a 
penalty function for mistakes made after training by over-fitting: recall 
that if one over-fits, then one will tend to make errors on new data. 
This penalty fn can be put into the quadratic programming problem 
directly. You don’t need to know this for this course.) 
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